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Abstract 

Web usage mining attempts to discover useful knowledge from the secondary data obtained from 
the interactions of the users with the Web. Web usage mining has become very critical for 
effective Web site management, business and support services, personalization, and network 
traffic flow analysis and so on. Web usage mining has become very critical for effective Web 
site management, creating adaptive Web sites, business and support services, personalization, 
and network traffic flow analysis and so on. Previous study on Web usage mining using a 
concurrent Clustering, Neural based approach has shown that the usage trend analysis very much 
depends on the performance of the clustering of the number of requests. In this paper, a novel 
approach Self Organizing Map is introduced kind of neural network, in the process of Web 
Usage Mining to detect user's patterns. The process details the transformations necessaries to 
modify the data storage in the Web Servers Log files to an input of SOM. 
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Introduction 

Web mining is the application of data mining techniques to extract knowledge from Web data 
including Web documents, hyperlinks between documents, usage logs of web sites, etc.. A panel 
organized at ICTAI 1997 asked the question "Is there anything distinct about Web mining 
(compared to data mining in general)?" While no definitive conclusions were reached then, the 
tremendous attention on Web mining in the past five years, and a number of significant ideas that 
have been developed, have answered this question in the affirmative in a big way. In addition, a 
fairly stable community of researchers interested in the area has been formed, largely through the 
successful series of WebKDD workshops, which have been held annually in conjunction with the 
ACM SIGKDD Conference since 1999, and the Web Analytics workshops, which have been 
held in conjunction with the SIAM data mining conference [1]. 

Web Mining can be broadly divided into three distinct categories, according to the kinds of data 
to be mined: 

Web content mining: Web Content Mining is the process of extracting useful information from 
the contents of Web documents. Content data corresponds to the collection of facts a Web page 
was designed to convey to the users. It may consist of text, images, audio, video, or structured 
records such as lists and tables. Text mining and its application to Web content has been the most 
widely researched. Some of the research issues addressed in text mining are, topic discovery, 
extracting association patterns, clustering of web documents and classification of Web Pages. 
Research activities in this field also involve using techniques from other disciplines such as 
Information Retrieval (IR) and Natural Language Processing (NLP). While there exists a 
significant body of work in extracting knowledge from images - in the fields of image 
processing and computer vision - the application of these techniques to Web content mining has 
not been very rapid. 

Web structure mining: The structure of a typical Web graph consists of Web pages as nodes, and 
hyperlinks as edges connecting between two related pages. Web Structure Mining can be 
regarded as the process of discovering structure information from the Web. 
Web usage mining: Web Usage Mining is the application of data mining techniques to discover 
interesting usage patterns from Web data, in order to understand and better serve the needs of 
Web-based applications. Usage data captures the identity or origin of Web users along with their 
browsing behavior at a Web site. The usage data can also be split into three different kinds on the 
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basis of the source of its collection: on the server side, the client side, and the proxy side. The 
key issue is that on the server side there is an aggregate picture of the usage of a service by all 
users, while on the client side there is complete picture of usage of all services by a particular 
client, with the proxy side being somewhere in the middle. 
Web usage mining 

The aim of Web usage mining is to discover patterns of user activities in order to better serve the 
needs of the users for example by dynamic link handling, by page recommendation etc. The aim 
of a Web site or Web portal is to supply the user the information which is useful for him. There 
is a great competition between the different commercial portals and Web sites because every user 
means eventually money (through advertisements, etc.). Thus the goal of each owner of a portal 
is to make his site more attractive for the user. For this reason the response time of each single 
site have to be kept below 2s. Moreover some extras have to be provided such as supplying 
dynamic content or links or recommending pages for the user that are possible of interest of the 
given user. Clustering of the user activities stored in different types of log files is a key issue in 
the Web community. 

There are three types of log files that can be used for Web usage mining. Log files are stored on 
the server side, on the client side and on the proxy servers. By having more than one place for 
storing the information of navigation patterns of the users makes the mining process more 
difficult. Really reliable results could be obtained only if one has data from all three types of log 
file. The reason for this is that the server side does not contain records of those Web page 
accesses that are cached on the proxy servers or on the client side. Besides the log file on the 
server, that on the proxy server provides additional information. However, the page requests 
stored in the client side are missing. Yet, it is problematic to collect all the information from the 
client side. Thus, most of the algorithms work based only the server Side data[2]. Web usage 
mining consists of four main steps: 
(I) Data collection 

(ii) Preprocessing, 

(iii) Pattern discovery 

(iv) Pattern analysis 
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Figure 1 : Shows the block diagram. 
In the preprocessing phase the data have to be collected from the different places it is stored 
(client side, server side, proxy servers). After identifying the users, the click-streams of each user 
has to be split into sessions. In general the timeout for determining a session is set to 30 minute. 
The pattern discovery phase means applying data mining techniques on the preprocessed log 
data. 

It can be frequent pattern mining, association rule mining or clustering. In this paper we are 
dealing only with the task of clustering web usage log. In web usage mining there are two types 
of clusters to be discovered: usage clusters and page clusters. The aim of clustering users is to 
establish groups of users having similar browsing behavior. The users can be clustered based on 
several information. In the one hand, the user can be requested filling out a form regarding their 
interests, for example when registration on the web portal. The clustering of the users can be 
accomplished based on the forms. On the other hand, the clustering can be made based on the 
information gained from the log data collected during the user was navigating through the portal. 
Different types of user data can be collected using these methods, for example (I) characteristics 
of the user (age, gender, etc.), (ii) preferences and interests of the user, (iii) user's behavior 
pattern. The aim of clustering web pages is to have groups of pages that have similar content. 
This information can be useful for search engines or for applications that create dynamic index 
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pages. The last step of the whole web usage mining process is to analyze the patterns found 
during the pattern discovery step. Web Usage Mining try to understand the patterns detected in 
before step. The most common techniques is data visualization applying filters. 
The artificial neural networks (ANN), try to simulate the action doing by the human brain; RNA 
has the possibility of get abstraction of data and work with incomplete data or with errors, RNA 
has knowledge and can adapt it; and operate in real time. RNA is built by a common part called 
neurons. These units of processing are interconnected; each neuron has it this activation 
threshold. The learning in RNA is built by the adjustment of activation threshold in each neuron. 
Problem statement 
Web log structure 

The log files are text files that can range in size from 1KB to 100MB, depending on the traffic at 
a given a website. The dataset of "Government College of engineering and technology, 
Amravati." is used for the process. This data contains a record of user interactions with the 
college website (originally http://www.gcoea.ac.in) . 

117.203.75.190 - - [10/Sep/2012:18:01:43 +0530] "GET /prajwalan2012/css/events.css 
HTTP/1.1" 200 1085 "http://www.gcoea.ac.in/prajwalan2012/" "Mozilla/5.0 (Windows NT 6.1) 
AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.861.0 Safari/535.2" 
There are various fields in this dataset are: 
IP address: "117.203.75.190 " 

This is the IP address of the machine that contacted our site. 
Username etc: "- -" 

Only relevant when accessing password-protected content. 

Timestamp: [10/Sep/2012:18:01:43 +0530]" 

Time stamp of the visit as seen by the web server. 

Access request: "GET /prajwalan2012/css/events. ess HTTP/1.1" 

The request made. In this case it was a "GET" request (i.e. "show me the page") for the file "/cgi- 
bin/log/source/vs/vs_main.cgi" using the "HTTP/1.1" protocol. A "HEAD" request fetches only 
the document header, and is the web equivalent of a "ping" to check your page is still there and 
hasn't changed. 
Result status code: "200" 
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The resulting status code. "200" is success. If the requested URL didn't exist, this is where the 
dreaded "404" would have shown up in the log. 
Bytes transferred: "1085" 

The number of bytes transferred. If this matches the size of the file requested, so this is a 
successful download. If the number is less, then that would indicate a failed or partial download. 
Some user agents can fetch files a bit at a time. Each bit will show up as a separate line in the log 
file, so a series of "hits" whose total adds up to, or exceeds, the file size could indicate a 
successful download. On the other hand it could indicate someone having trouble connecting to 
site who has to keep reconnecting. 
ReferrerURL:"http://www.gcoea.ac.in/prajwalan2012/" 

The referring page. Not all user agents supply this information. This is the page the visitor is on 
when they clicked to come to this page. Sometimes this is simply the page the user was looking 
at when they typed in address into their browser, or clicked on the address in some other 
software such as a newsreader or an email client. This information is very useful to webmasters, 
as it allows them to measure which sites are driving traffic to their site. It also represents a small 
loss of privacy, as it lets us see where visitors are coming from. 

User agent: "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) 
Chrome/15.0.861.0 Safari/535.2" 

The "User Agent" identifier. The User Agent is whatever software the visitor used to access this 
site. It's usually a browser, but it could equally be a web robot, a link checker, an FTP client or 
an offline browser. The "user agent" string is set by the software manufacturer, and can be 
anything they choose to be. In this case "Mozilla/4.7" probably means Netscape 4.7, "[en]" 
probably implies it's an English version, "Win 95" indicates Windows 95 etc, etc. Well-behaved 
web bots and spiders will usually use this string to identify themselves, their web site and an 
email address. 
Data preparation 

There are some important technical issues that must be taken into consideration during this phase 
in the context of the Web personalization process, because it is necessary for Web log data to be 
prepared and preprocessed in order to use them in the consequent phases of the process[5]. 
The data pre-processing consist of the following steps: 
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Data field extraction: The log file is read character by character up to the end and then by using 
the methods of String Tokenizer class the data fields are broken into tokens and savedin an array. 
Before data storage we need to create the table named as log table in which each entry from the 
original log file is stored. Total 4210 records are stored in database. 

Data Cleaning: With the important entries, a web log file may consist of certain undesirable 
rather useless data which has nothing to do with the mining procedure. Data cleaning concerned 
with removing all the data tracked in web logs that are useless for mining purposes e.g. requests 
for graphical page content (e.g. jpg, jgeg, gif, js, ess, swf, avi, mov, etc.). Request for any other 
file which might be included in to web page or even navigation sessions performed by robots and 
web spiders. Robots and web spider navigation patterns must be explicitly identified. After the 
cleaning process the records are reduced to around 50% i.e. 2710 records. 

User identification: Users is uniquely Identified by combination of referrer URL and user agent 
(eg. 1.2.3.4 + IE5; Win2k). There are 175 unique users in 2710 records. 

User session: After identify the users; we need to identify the sessions. To do this we can divide 
the access of the same users in sessions. It's difficult to detect when one session is finish and 
start another. To detect sessions is common use of time between requests; if two requests are 
called in of time frame, we can suppose that these requests are in the same session; in other way 
below of time frame we can consider two different sessions. A good time frame is 30 minutes. 



IV. Solution proposed 

Given the preprocessing steps outlined above, for the rest of this implementation we assume that 
there is a set of n unique URLs U = {urll, url2, ...,urln}, appearing in the preprocessed log, and 
a set of m user transactions T= {tl, t2, . . ., tm}, where each tie T is a non-empty subset of U. 

t = < w (V.f A) , t), w (V 2 (A) , t). . . w (V^ A) , t)> 
The pattern discovery phase means applying data mining techniques on the preprocessed log 
data. In this dissertation clustering is done and for that two algorithm k-Means and Self 
Organizing Map (SOM) is used. The artificial neural network, in this case SOM, has an arbitrary 
number of input neurons, this number is pre-defined , to do this the number of the most common 
pages in the site; by the way each site is probably has different artificial neural network 
architecture. The output of SOM is a map of M X N dimensions; the user configures N; this is 
the number of N cluster that the users want to obtain; in the output only one cluster will be 
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activated. The same pattern of input will generate the activation of the same output cluster; 
similar inputs will be activated near output clusters. 
, Result 

This paper; we can see the comparison between both methods, SOM and K-Means, in 
gcoea.ac.in sites. For this site it has been develop a complete process involved in Web Usages 
Mining. This website contains information about college, student , department , staff etc. In 
figures 2 and 3 we can see the quantity of clusters generated with both methods. Initialize the 
number of output clusters as 3. 




I cluster 
I cluster 1 
cluster 2 



Fig. 2. Percentage of sessions in each cluster with K-Means 




■ cluster 

■ cluster 1 
cluster 2 



Fig. 1. Percentage of sessions in each cluster with SOM 

VI. Conclusion 

We can conclude that; to identify common patterns in Web, the self-organized map (SOM) is 
better than K-Means. SOM has a better group of users. With K-Means we get a few information 
about user's habits. In other way SOM build some gathering with a great quantity of user's 
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sessions for the same users; but K-Means has a better distribution of user's sessions in each 
group. 
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